Answers to Exercises: Let’s practice more GLMs

Alexandre Courtiol

2018-05-03

Disclamer

In this vignette I illustrate the key steps required to solve the exercises. By no means I am trying to provide a detailed report of the analysis of each dataset. I am trying to put emphasis on different things you could do in the different exercise. Any final real report would need additional analyses and results would have to be written in good English, not in telegraphic style or lines of codes. I have also written this vignette very quickly, so double check I did not mess anything up…

Dataset: InsectSprays

Goal

Note: I will not check the assumptions for LM, nor detail the dataset (and plotting variables) since we did it before.

Fitting the models

We fit the LM:

mod_insect <- lm(count ~ spray, data = InsectSprays)

We fit the LM on the BoxCox transformed response:

InsectSprays$count_bis <- InsectSprays$count + 1
mod_insect_bis <- update(mod_insect, count_bis ~ .)
bc <- car::powerTransform(mod_insect_bis)
InsectSprays$count_bc <- car::bcPower(InsectSprays$count_bis, bc$lambda)
mod_insect_bc <- update(mod_insect_bis, count_bc ~ .)

We fit the GLM:

mod_insect_glm <- glm(count ~ spray, data = InsectSprays, family = poisson())

Checking the assumptions for GLM

All assumptions about model structure have been checked before (see exercises for LM).

We still have to test for lack of serial auto-correlation and independance:

plot(residuals(mod_insect_glm, type = "pearson"))

library(DHARMa)
insect_glm_resid <- simulateResiduals(mod_insect_glm, refit = TRUE, n = 1000)
plot(insect_glm_resid)
## DHARMa::plotResiduals - low number of unique predictor values, consider setting asFactor = T
## DHARMa::plotResiduals - low number of unique predictor values, consider setting asFactor = T

testUniformity(insect_glm_resid)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.094556, p-value = 0.5404
## alternative hypothesis: two-sided
plot(insect_glm_resid, asFactor = TRUE)
## Warning in plotResiduals(pred = x, residuals = NULL, xlab = "Predicted value (rank transformed)", : DHARMa::plotResiduals - predictor is a factor, rank = T has no effect

testTemporalAutocorrelation(insect_glm_resid, time = insect_glm_resid$fitted)

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 1.8008, p-value = 0.1975
## alternative hypothesis: true autocorrelation is greater than 0

All look good. We also have to test for the goodness of fit:

testOverdispersion(insect_glm_resid, alternative = "both")
## 
##  DHARMa nonparametric overdispersion test via comparison to simulation under H0 = fitted model
## 
## data:  insect_glm_resid
## dispersion = 1.5104, p-value = 0.012
## alternative hypothesis: both

There seem to be slight overdispersion :-(

Solving the overdispersion issue

Is the overdispersion caused by two many zeros?

table(mod_insect_glm$model$count)
## 
##  0  1  2  3  4  5  6  7  9 10 11 12 13 14 15 16 17 19 20 21 22 23 24 26 
##  2  6  4  8  4  7  3  3  1  3  3  2  4  4  2  2  4  1  2  2  1  1  1  2
testZeroInflation(insect_glm_resid)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with simulation under H0 = fitted model
## 
## data:  insect_glm_resid
## ratioObsExp = 1.0521, p-value = 0.293
## alternative hypothesis: more

nope…

Let’s try to fit alternative models to account for the overdispersion:

mod_insect_glm_quasi <- glm(count ~ spray, data = InsectSprays, family = quasipoisson())
library(MASS)
mod_insect_glm_nb <- glm.nb(count ~ spray, data = InsectSprays)
rbind(deviance(mod_insect_glm),
      deviance(mod_insect_glm_quasi),
      deviance(mod_insect_glm_nb))
##          [,1]
## [1,] 98.32866
## [2,] 98.32866
## [3,] 74.14497

Note that the prediction are the same:

plot(predict(mod_insect_glm, type = "response"), mod_insect_glm$model$count)
abline(0, 1, col = "red")

plot(predict(mod_insect_glm_quasi, type = "response"), mod_insect_glm_quasi$model$count)
abline(0, 1, col = "red")

plot(predict(mod_insect_glm_nb, type = "response"), mod_insect_glm_nb$model$count)
abline(0, 1, col = "red")

The only thing that differ is the variance around those predictions.

Note also that the gaussian models have deviance expressed on another scale, which makes the comparison between all models particularly difficult.

pred_bc <- ((predict(mod_insect_bc) * bc$lambda) + 1)^(1/bc$lambda) - 1
(deviance_LM_bc <- sum((pred_bc - InsectSprays$count)^2))
## [1] 1028.846
deviance(mod_insect)
## [1] 1015.167

We can compare the poisson model with the negative binomial one very simply:

library(lmtest)
lrtest(mod_insect_glm_nb, mod_insect_glm)
## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of class "negbin", updated model is of class "glm"
## Likelihood ratio test
## 
## Model 1: count ~ spray
## Model 2: count ~ spray
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1   7 -180.11                       
## 2   6 -182.29 -1 4.3713    0.03655 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let us stick to the negative binomial model as our best GLM!

Retesting assumptions on the negative binomial model

insect_glm_nb_resid   <- simulateResiduals(mod_insect_glm_nb, refit = TRUE, n = 1000)
testUniformity(insect_glm_nb_resid)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.076, p-value = 0.7999
## alternative hypothesis: two-sided
testTemporalAutocorrelation(insect_glm_nb_resid)

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 2.1796, p-value = 0.7783
## alternative hypothesis: true autocorrelation is greater than 0
testOverdispersion(insect_glm_nb_resid, alternative = "both")
## 
##  DHARMa nonparametric overdispersion test via comparison to simulation under H0 = fitted model
## 
## data:  insect_glm_nb_resid
## dispersion = 1.0673, p-value = 0.408
## alternative hypothesis: both
testZeroInflation(insect_glm_nb_resid)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with simulation under H0 = fitted model
## 
## data:  insect_glm_nb_resid
## ratioObsExp = 0.89646, p-value = 0.383
## alternative hypothesis: more

Everything looks very good!

Tests comparing the fitted models to their respective null models

anova(mod_insect, update(mod_insect, . ~ 1))
## Analysis of Variance Table
## 
## Model 1: count ~ spray
## Model 2: count ~ 1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     66 1015.2                                  
## 2     71 3684.0 -5   -2668.8 34.702 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod_insect_bc, update(mod_insect_bc, . ~ 1))
## Analysis of Variance Table
## 
## Model 1: count_bc ~ spray
## Model 2: count_bc ~ 1
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     66  31.998                                  
## 2     71 150.337 -5   -118.34 48.818 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod_insect_glm, update(mod_insect_glm, . ~ 1), test = "LR")
## Analysis of Deviance Table
## 
## Model 1: count ~ spray
## Model 2: count ~ 1
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        66      98.33                          
## 2        71     409.04 -5  -310.71 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod_insect_glm_nb, update(mod_insect_glm_nb, . ~ 1))
## Likelihood ratio tests of Negative Binomial Models
## 
## Response: count
##   Model     theta Resid. df    2 x log-lik.   Test    df LR stat. Pr(Chi)
## 1     1  1.736021        71       -467.9604                              
## 2 spray 28.099507        66       -360.2179 1 vs 2     5 107.7425       0

No matter the model the effect of the insecticed is detected!

Predictions

We now compare the predictions and CI between the different models:

data.for.pred <- data.frame(spray = levels(InsectSprays$spray))
pred_lm <- predict(mod_insect, newdata = data.for.pred, interval = "confidence")
pred_bc <- predict(mod_insect_bc, newdata = data.for.pred, interval = "confidence")
unboxcox <- function(x, lambda) (x*lambda + 1)^(1/lambda)
pred_bc_unboxcox <- unboxcox(x = pred_bc, lambda = bc$lambda) - 1  ## we remove one as we added one so to be able to do the BoxCox
pred_glm_eta <- predict(mod_insect_glm, newdata = data.for.pred, se.fit = TRUE)
pred_glm_table <- data.frame(fit = exp(pred_glm_eta$fit))
pred_glm_table$lwr <- exp(pred_glm_eta$fit + qnorm(0.025)*pred_glm_eta$se.fit)
pred_glm_table$upr <- exp(pred_glm_eta$fit + qnorm(0.975)*pred_glm_eta$se.fit)
pred_glm_nb_eta <- predict(mod_insect_glm_nb, newdata = data.for.pred, se.fit = TRUE)
pred_glm_nb_table <- data.frame(fit = spaMM::negbin()$linkinv(pred_glm_eta$fit))
pred_glm_nb_table$lwr <- spaMM::negbin()$linkinv(pred_glm_nb_eta$fit + qnorm(0.025)*pred_glm_nb_eta$se.fit)
pred_glm_nb_table$upr <- spaMM::negbin()$linkinv(pred_glm_nb_eta$fit + qnorm(0.975)*pred_glm_nb_eta$se.fit)

Observe carefully the results:

Plot

plot(InsectSprays$count ~  as.numeric(InsectSprays$spray), xlim = c(0.5, 6.5), ylim = c(0, 30),
     ylab = "Predicted number of instects", xlab = "Sprays", axes = FALSE)
axis(2, las = 2)
axis(1, at = 1:6, labels = levels(InsectSprays$spray))
box()

points(pred_lm[, "fit"] ~  I(-0.2 + 1:6), col = "blue", pch = 20)

arrows(x0 = -0.2 + 1:6, x1 = -0.2 + 1:6, y0 = pred_lm[, "lwr"],
       y1 = pred_lm[, "upr"], code = 3, col = "blue", angle = 90, length = 0.1)

points(pred_bc_unboxcox[, "fit"] ~  I(-0.1 + 1:6), col = "green", pch = 20)

arrows(x0 = -0.1 + 1:6, x1 = -0.1 + 1:6, y0 = pred_bc_unboxcox[, "lwr"],
       y1 = pred_bc_unboxcox[, "upr"], code = 3, col = "green", angle = 90, length = 0.1)

points(pred_glm_table[, "fit"] ~  I(0.1 + 1:6), col = "red", pch = 20)

arrows(x0 = 0.1 + 1:6, x1 = 0.1 + 1:6, y0 = pred_glm_table[, "lwr"],
       y1 = pred_glm_table[, "upr"], code = 3, col = "red", angle = 90, length = 0.1)

points(pred_glm_nb_table[, "fit"] ~  I(0.2 + 1:6), col = "orange", pch = 20)

arrows(x0 = 0.2 + 1:6, x1 = 0.2 + 1:6, y0 = pred_glm_nb_table[, "lwr"],
       y1 = pred_glm_table[, "upr"], code = 3, col = "orange", angle = 90, length = 0.1)

legend("top", fill = c("blue", "green", "red", "orange"), legend = c("LM", "LM + BoxCox", "GLM Poisson", "GLM Negative Binomial"), bty = "n")

Testing properly if BoxCox model is better than NegBin one

set.seed(1L)
logLik_diff_H0 <- replicate(1000, {
  new.y <- round(((simulate(mod_insect_bc)[, 1] * bc$lambda) + 1)^(1/bc$lambda) - 1)
  new.y_plus <- new.y + abs(min(new.y)) - min(new.y + abs(min(new.y)))
  new.y_plus1 <- new.y_plus + 1
  mod_insect_bis <- lm(new.y_plus1 ~ spray, data = InsectSprays)
  bc_new <- car::powerTransform(mod_insect_bis)
  new.y_plus1_bc <- car::bcPower(new.y_plus1, bc_new$lambda)
  mod_insect_bc_new <- lm(new.y_plus1_bc ~ spray, data = InsectSprays)
  mod_insect_glm_nb_new <- glm.nb(new.y_plus ~ spray, data = InsectSprays)
  logLik(mod_insect_bc_new)[[1]] - logLik(mod_insect_glm_nb_new)[[1]]
  })
(logLik_diff_obs <- logLik(mod_insect_bc)[[1]] - logLik(mod_insect_glm_nb)[[1]])
## [1] 107.1409
(sum(logLik_diff_obs < (logLik_diff_H0)) + 1) / (length(logLik_diff_H0) + 1)
## [1] 0.4075924
hist(logLik_diff_H0, nclass = 10)
abline(v = logLik_diff_obs, col = "red", lwd = 2, lty = 2)

Testing properly if NegBin model is better than BoxCox one

set.seed(1L)
logLik_diff_H0 <- replicate(1000, {
  new.y <- simulate(mod_insect_glm_nb)[, 1]
  new.y_plus <- new.y + abs(min(new.y)) - min(new.y + abs(min(new.y)))
  new.y_plus1 <- new.y_plus + 1
  mod_insect_bis <- lm(new.y_plus1 ~ spray, data = InsectSprays)
  bc_new <- car::powerTransform(mod_insect_bis)
  new.y_plus1_bc <- car::bcPower(new.y_plus1, bc_new$lambda)
  mod_insect_bc_new <- lm(new.y_plus1_bc ~ spray, data = InsectSprays)
  mod_insect_glm_nb_new <- glm.nb(new.y ~ spray, data = InsectSprays)
  logLik(mod_insect_bc_new)[[1]] - logLik(mod_insect_glm_nb_new)[[1]]
  })
(logLik_diff_obs <- logLik(mod_insect_bc)[[1]] - logLik(mod_insect_glm_nb)[[1]])
## [1] 107.1409
(sum(logLik_diff_obs < (logLik_diff_H0)) + 1) / (length(logLik_diff_H0) + 1)
## [1] 0.2237762
hist(logLik_diff_H0, nclass = 10)
abline(v = logLik_diff_obs, col = "red", lwd = 2, lty = 2)

The two models do not differ, we could choose either!

This is not a general results. On other data LM on BoxCoxed data and a negative binomial fit of the data may greatly differ!!!


Dataset: LM2GLMM::Surprise

Goal

Solution

Try figuring out this one on your own! It’s all in the title.


Dataset: esoph

Goal

Exploration of the data

esoph
##    agegp     alcgp    tobgp ncases ncontrols
## 1  25-34 0-39g/day 0-9g/day      0        40
## 2  25-34 0-39g/day    10-19      0        10
## 3  25-34 0-39g/day    20-29      0         6
## 4  25-34 0-39g/day      30+      0         5
## 5  25-34     40-79 0-9g/day      0        27
## 6  25-34     40-79    10-19      0         7
## 7  25-34     40-79    20-29      0         4
## 8  25-34     40-79      30+      0         7
## 9  25-34    80-119 0-9g/day      0         2
## 10 25-34    80-119    10-19      0         1
## 11 25-34    80-119      30+      0         2
## 12 25-34      120+ 0-9g/day      0         1
## 13 25-34      120+    10-19      1         1
## 14 25-34      120+    20-29      0         1
## 15 25-34      120+      30+      0         2
## 16 35-44 0-39g/day 0-9g/day      0        60
## 17 35-44 0-39g/day    10-19      1        14
## 18 35-44 0-39g/day    20-29      0         7
## 19 35-44 0-39g/day      30+      0         8
## 20 35-44     40-79 0-9g/day      0        35
## 21 35-44     40-79    10-19      3        23
## 22 35-44     40-79    20-29      1        14
## 23 35-44     40-79      30+      0         8
## 24 35-44    80-119 0-9g/day      0        11
## 25 35-44    80-119    10-19      0         6
## 26 35-44    80-119    20-29      0         2
## 27 35-44    80-119      30+      0         1
## 28 35-44      120+ 0-9g/day      2         3
## 29 35-44      120+    10-19      0         3
## 30 35-44      120+    20-29      2         4
## 31 45-54 0-39g/day 0-9g/day      1        46
## 32 45-54 0-39g/day    10-19      0        18
## 33 45-54 0-39g/day    20-29      0        10
## 34 45-54 0-39g/day      30+      0         4
## 35 45-54     40-79 0-9g/day      6        38
## 36 45-54     40-79    10-19      4        21
## 37 45-54     40-79    20-29      5        15
## 38 45-54     40-79      30+      5         7
## 39 45-54    80-119 0-9g/day      3        16
## 40 45-54    80-119    10-19      6        14
## 41 45-54    80-119    20-29      1         5
## 42 45-54    80-119      30+      2         4
## 43 45-54      120+ 0-9g/day      4         4
## 44 45-54      120+    10-19      3         4
## 45 45-54      120+    20-29      2         3
## 46 45-54      120+      30+      4         4
## 47 55-64 0-39g/day 0-9g/day      2        49
## 48 55-64 0-39g/day    10-19      3        22
## 49 55-64 0-39g/day    20-29      3        12
## 50 55-64 0-39g/day      30+      4         6
## 51 55-64     40-79 0-9g/day      9        40
## 52 55-64     40-79    10-19      6        21
## 53 55-64     40-79    20-29      4        17
## 54 55-64     40-79      30+      3         6
## 55 55-64    80-119 0-9g/day      9        18
## 56 55-64    80-119    10-19      8        15
## 57 55-64    80-119    20-29      3         6
## 58 55-64    80-119      30+      4         4
## 59 55-64      120+ 0-9g/day      5        10
## 60 55-64      120+    10-19      6         7
## 61 55-64      120+    20-29      2         3
## 62 55-64      120+      30+      5         6
## 63 65-74 0-39g/day 0-9g/day      5        48
## 64 65-74 0-39g/day    10-19      4        14
## 65 65-74 0-39g/day    20-29      2         7
## 66 65-74 0-39g/day      30+      0         2
## 67 65-74     40-79 0-9g/day     17        34
## 68 65-74     40-79    10-19      3        10
## 69 65-74     40-79    20-29      5         9
## 70 65-74    80-119 0-9g/day      6        13
## 71 65-74    80-119    10-19      4        12
## 72 65-74    80-119    20-29      2         3
## 73 65-74    80-119      30+      1         1
## 74 65-74      120+ 0-9g/day      3         4
## 75 65-74      120+    10-19      1         2
## 76 65-74      120+    20-29      1         1
## 77 65-74      120+      30+      1         1
## 78   75+ 0-39g/day 0-9g/day      1        18
## 79   75+ 0-39g/day    10-19      2         6
## 80   75+ 0-39g/day      30+      1         3
## 81   75+     40-79 0-9g/day      2         5
## 82   75+     40-79    10-19      1         3
## 83   75+     40-79    20-29      0         3
## 84   75+     40-79      30+      1         1
## 85   75+    80-119 0-9g/day      1         1
## 86   75+    80-119    10-19      1         1
## 87   75+      120+ 0-9g/day      2         2
## 88   75+      120+    10-19      1         1
str(esoph)
## 'data.frame':    88 obs. of  5 variables:
##  $ agegp    : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alcgp    : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ tobgp    : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ ncases   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ncontrols: num  40 10 6 5 27 7 4 7 2 1 ...

The factors are ordered which will select for different default contrast, so we set them as unordered:

esoph$agegp_f <- factor(esoph$agegp, ordered = FALSE)
esoph$alcgp_f <- factor(esoph$alcgp, ordered = FALSE)
esoph$tobgp_f <- factor(esoph$tobgp, ordered = FALSE)
summary(esoph)
##    agegp          alcgp         tobgp        ncases         ncontrols      agegp_f        alcgp_f       tobgp_f  
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 1.00   25-34:15   0-39g/day:23   0-9g/day:24  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 3.00   35-44:15   40-79    :23   10-19   :24  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 6.00   45-54:16   80-119   :21   20-29   :20  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   :11.08   55-64:16   120+     :21   30+     :20  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:14.00   65-74:15                               
##  75+  :11                                Max.   :17.000   Max.   :60.00   75+  :11
esoph$cancer_freq <- esoph$ncases / (esoph$ncases + esoph$ncontrols)
esoph$cancer_N <- esoph$ncases + esoph$ncontrols
summary(esoph$cancer_freq) ## proportion of cancer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2111  0.2089  0.3679  0.5000
table(esoph$alcgp_f, esoph$tobgp_f)
##            
##             0-9g/day 10-19 20-29 30+
##   0-39g/day        6     6     5   6
##   40-79            6     6     6   5
##   80-119           6     6     4   5
##   120+             6     6     5   4
table(esoph$alcgp_f, esoph$tobgp_f, esoph$agegp_f)
## , ,  = 25-34
## 
##            
##             0-9g/day 10-19 20-29 30+
##   0-39g/day        1     1     1   1
##   40-79            1     1     1   1
##   80-119           1     1     0   1
##   120+             1     1     1   1
## 
## , ,  = 35-44
## 
##            
##             0-9g/day 10-19 20-29 30+
##   0-39g/day        1     1     1   1
##   40-79            1     1     1   1
##   80-119           1     1     1   1
##   120+             1     1     1   0
## 
## , ,  = 45-54
## 
##            
##             0-9g/day 10-19 20-29 30+
##   0-39g/day        1     1     1   1
##   40-79            1     1     1   1
##   80-119           1     1     1   1
##   120+             1     1     1   1
## 
## , ,  = 55-64
## 
##            
##             0-9g/day 10-19 20-29 30+
##   0-39g/day        1     1     1   1
##   40-79            1     1     1   1
##   80-119           1     1     1   1
##   120+             1     1     1   1
## 
## , ,  = 65-74
## 
##            
##             0-9g/day 10-19 20-29 30+
##   0-39g/day        1     1     1   1
##   40-79            1     1     1   0
##   80-119           1     1     1   1
##   120+             1     1     1   1
## 
## , ,  = 75+
## 
##            
##             0-9g/day 10-19 20-29 30+
##   0-39g/day        1     1     0   1
##   40-79            1     1     1   1
##   80-119           1     1     0   0
##   120+             1     1     0   0

We plot the data:

coplot(cancer_freq ~ tobgp_f | agegp_f, data = esoph)

coplot(cancer_freq ~ alcgp_f | agegp_f, data = esoph)

We fit the models

mod_esoph_logit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph,
                       family = binomial(link = "logit"))
mod_esoph_cauchit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph,
                         family = binomial(link = "cauchit"))
mod_esoph_probit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph,
                        family = binomial(link = "probit"))
AIC(mod_esoph_logit)
## [1] 225.454
AIC(mod_esoph_cauchit)
## [1] 242.4252
AIC(mod_esoph_probit)
## [1] 221.825

The probit model fits the data best.

Comparison to null model

mod_esoph_probit_H0 <- glm(cbind(ncases, ncontrols) ~ 1, data = esoph,
                           family = binomial(link = "probit"))
anova(mod_esoph_probit, mod_esoph_probit_H0, test = "LR")
## Analysis of Deviance Table
## 
## Model 1: cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f
## Model 2: cbind(ncases, ncontrols) ~ 1
##   Resid. Df Resid. Dev  Df Deviance  Pr(>Chi)    
## 1        76     50.344                           
## 2        87    227.241 -11   -176.9 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model is much better than a null model!

Testing the model assumptions

esoph_probit_resid <- simulateResiduals(mod_esoph_probit, refit = TRUE, n = 1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
plot(esoph_probit_resid)

testUniformity(esoph_probit_resid)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.098545, p-value = 0.3599
## alternative hypothesis: two-sided
testTemporalAutocorrelation(esoph_probit_resid, time = 1:nrow(esoph))

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 2.0787, p-value = 0.6445
## alternative hypothesis: true autocorrelation is greater than 0
testTemporalAutocorrelation(esoph_probit_resid, time = mod_esoph_probit$fitted.values)

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 2.0433, p-value = 0.5807
## alternative hypothesis: true autocorrelation is greater than 0
testZeroInflation(esoph_probit_resid)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with simulation under H0 = fitted model
## 
## data:  esoph_probit_resid
## ratioObsExp = 0.96783, p-value = 0.576
## alternative hypothesis: more
testOverdispersion(esoph_probit_resid, alternative = "both")
## 
##  DHARMa nonparametric overdispersion test via comparison to simulation under H0 = fitted model
## 
## data:  esoph_probit_resid
## dispersion = 0.74473, p-value = 0.17
## alternative hypothesis: both

Testing the model effects

Anova(mod_esoph_probit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: cbind(ncases, ncontrols)
##         LR Chisq Df Pr(>Chisq)    
## agegp_f   80.109  5  7.961e-16 ***
## alcgp_f   69.533  3  5.373e-15 ***
## tobgp_f   11.593  3   0.008915 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plotting the effects

library(effects)
plot(allEffects(mod_esoph_probit))

plot(allEffects(mod_esoph_probit, transformation = list(link = NULL, inverse = NULL)))

library(visreg)
visreg(mod_esoph_probit, scale = "response")

visreg2d(mod_esoph_probit, "alcgp_f", "tobgp_f", plot.type = "image", scale = "response")

visreg2d(mod_esoph_probit, "alcgp_f", "tobgp_f", plot.type = "persp", scale = "response")

Note: the two packages differ in how they compute predictions. The package effects seems to here compute the average effect of the non focal factor by weighing them by their sample size. The package visreg seems to here select the level with the highest number of observation for the non focal factor. I am not sure what it does when there are equalities in sample size… Doing predictions yourself is perhaps the easiest way to know what is actually happening!

Predictions by hand

Let us compare extreme situations

newdata_esoph_young <- expand.grid(agegp_f = c("25-34"),
                                   alcgp_f = c("0-39g/day", "120+"),
                                   tobgp_f = c("0-9g/day", "30+"))
pred_esoph_young <- predict(mod_esoph_probit, newdata = newdata_esoph_young, type = "link", se.fit = TRUE)
newdata_esoph_young$pred <- binomial(link = "probit")$linkinv(pred_esoph_young$fit)
newdata_esoph_young$lwr <- binomial(link = "probit")$linkinv(pred_esoph_young$fit + qnorm(0.025)*pred_esoph_young$se.fit)
newdata_esoph_young$upr <- binomial(link = "probit")$linkinv(pred_esoph_young$fit + qnorm(0.975)*pred_esoph_young$se.fit)
newdata_esoph_young
##   agegp_f   alcgp_f  tobgp_f         pred          lwr        upr
## 1   25-34 0-39g/day 0-9g/day 0.0007619471 0.0000273923 0.01055611
## 2   25-34      120+ 0-9g/day 0.0262383385 0.0024738469 0.14276353
## 3   25-34 0-39g/day      30+ 0.0039002480 0.0002025362 0.03717823
## 4   25-34      120+      30+ 0.0764025052 0.0106660728 0.28864986
newdata_esoph_old <- expand.grid(agegp_f = c("75+"),
                                 alcgp_f = c("0-39g/day", "120+"),
                                 tobgp_f = c("0-9g/day", "30+"))
pred_esoph_old <- predict(mod_esoph_probit, newdata = newdata_esoph_old, type = "link", se.fit = TRUE)
newdata_esoph_old$pred <- binomial(link = "probit")$linkinv(pred_esoph_old$fit)
newdata_esoph_old$lwr <- binomial(link = "probit")$linkinv(pred_esoph_old$fit + qnorm(0.025)*pred_esoph_old$se.fit)
newdata_esoph_old$upr <- binomial(link = "probit")$linkinv(pred_esoph_old$fit + qnorm(0.975)*pred_esoph_old$se.fit)
newdata_esoph_old
##   agegp_f   alcgp_f  tobgp_f       pred       lwr       upr
## 1     75+ 0-39g/day 0-9g/day 0.09511196 0.0427947 0.1838748
## 2     75+      120+ 0-9g/day 0.46850630 0.2953570 0.6479350
## 3     75+ 0-39g/day      30+ 0.21173725 0.1003285 0.3740544
## 4     75+      120+      30+ 0.66657817 0.4708055 0.8249030

So alcool seems to be a bit worse than smoking but we are extrapolating a bit since very few individuals correspond to the predicted conditions! (see table() call above)

One thing is clear, not smoking and not drinking seems to be the way to avoid these cancers.

Trying to fit the interaction to answer our question

esoph$drink_smoke <- factor(paste(esoph$alcgp_f, esoph$tobgp_f, sep = "_"))
mod_esoph_probit2 <- glm(cbind(ncases, ncontrols) ~ agegp_f + drink_smoke, data = esoph,
                        family = binomial(link = "probit"))
Anova(mod_esoph_probit2)
## Analysis of Deviance Table (Type II tests)
## 
## Response: cbind(ncases, ncontrols)
##             LR Chisq Df Pr(>Chisq)    
## agegp_f       79.061  5  1.319e-15 ***
## drink_smoke   94.224 15  1.606e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_esoph_probit2)
## 
## Call:
## glm(formula = cbind(ncases, ncontrols) ~ agegp_f + drink_smoke, 
##     family = binomial(link = "probit"), data = esoph)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8509  -0.5300  -0.2056   0.2616   1.9162  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -3.3734     0.4531  -7.446 9.62e-14 ***
## agegp_f35-44                 0.7288     0.4542   1.604 0.108615    
## agegp_f45-54                 1.4536     0.4333   3.355 0.000794 ***
## agegp_f55-64                 1.6772     0.4305   3.896 9.78e-05 ***
## agegp_f65-74                 1.9048     0.4355   4.373 1.22e-05 ***
## agegp_f75+                   1.8491     0.4675   3.955 7.64e-05 ***
## drink_smoke0-39g/day_10-19   0.5839     0.2418   2.415 0.015737 *  
## drink_smoke0-39g/day_20-29   0.5995     0.3065   1.956 0.050454 .  
## drink_smoke0-39g/day_30+     0.9204     0.3302   2.787 0.005320 ** 
## drink_smoke120+_0-9g/day     1.5668     0.2609   6.006 1.90e-09 ***
## drink_smoke120+_10-19        1.6410     0.2891   5.676 1.38e-08 ***
## drink_smoke120+_20-29        1.7381     0.3495   4.973 6.58e-07 ***
## drink_smoke120+_30+          1.6849     0.3173   5.311 1.09e-07 ***
## drink_smoke40-79_0-9g/day    0.8609     0.1939   4.441 8.95e-06 ***
## drink_smoke40-79_10-19       0.9881     0.2234   4.423 9.71e-06 ***
## drink_smoke40-79_20-29       1.0136     0.2341   4.329 1.50e-05 ***
## drink_smoke40-79_30+         1.3801     0.2923   4.721 2.34e-06 ***
## drink_smoke80-119_0-9g/day   1.0703     0.2257   4.743 2.11e-06 ***
## drink_smoke80-119_10-19      1.1832     0.2303   5.137 2.80e-07 ***
## drink_smoke80-119_20-29      1.1478     0.3330   3.447 0.000566 ***
## drink_smoke80-119_30+        1.5511     0.3470   4.470 7.83e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 227.241  on 87  degrees of freedom
## Residual deviance:  44.889  on 67  degrees of freedom
## AIC: 234.37
## 
## Number of Fisher Scoring iterations: 6

Ideally we should again check all assumptions (you should do it!).

Let us do the minimum here:

plot(simulateResiduals(mod_esoph_probit2, refit = TRUE, n = 1000))

Let us now compare the alcool and drink effects.

library(multcomp)
table(esoph$drink_smoke)
## 
## 0-39g/day_0-9g/day    0-39g/day_10-19    0-39g/day_20-29      0-39g/day_30+      120+_0-9g/day         120+_10-19         120+_20-29           120+_30+     40-79_0-9g/day        40-79_10-19 
##                  6                  6                  5                  6                  6                  6                  5                  4                  6                  6 
##        40-79_20-29          40-79_30+    80-119_0-9g/day       80-119_10-19       80-119_20-29         80-119_30+ 
##                  6                  5                  6                  6                  4                  5
contr_base <- matrix(c(1, rep(0, 15)), nrow = 1) ## define the desing matrix for the posthoc test
contr <- contr_base
contr[levels(esoph$drink_smoke) == "120+_0-9g/day"] <- 1
contr[levels(esoph$drink_smoke) == "0-39g/day_30+"] <- -1
colnames(contr) <- levels(esoph$drink_smoke)
rownames(contr) <- "alcool - smoke"
contr
##                0-39g/day_0-9g/day 0-39g/day_10-19 0-39g/day_20-29 0-39g/day_30+ 120+_0-9g/day 120+_10-19 120+_20-29 120+_30+ 40-79_0-9g/day 40-79_10-19 40-79_20-29 40-79_30+ 80-119_0-9g/day
## alcool - smoke                  1               0               0            -1             1          0          0        0              0           0           0         0               0
##                80-119_10-19 80-119_20-29 80-119_30+
## alcool - smoke            0            0          0
posthoc <- glht(mod_esoph_probit2, linfct = mcp(drink_smoke = contr))
par(oma = c(1, 5, 1, 1))
plot(posthoc)

confint(posthoc)
## 
##   Simultaneous Confidence Intervals
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: glm(formula = cbind(ncases, ncontrols) ~ agegp_f + drink_smoke, 
##     family = binomial(link = "probit"), data = esoph)
## 
## Quantile = 1.96
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                     Estimate lwr      upr     
## alcool - smoke == 0  0.64643 -0.04785  1.34071
summary(posthoc, test = Chisqtest()) ## choose tests the better suited to your case
## 
##   General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Linear Hypotheses:
##                     Estimate
## alcool - smoke == 0   0.6464
## 
## Global Test:
##   Chisq DF Pr(>Chisq)
## 1  3.33  1    0.06802

We have compared two different situations, using a posthoc approach.

Alternative: effects considered as linear

esoph$alc_num <- unclass(esoph$alcgp)
esoph$tob_num <- unclass(esoph$tobgp)
mod_esoph_probit3 <- glm(cbind(ncases, ncontrols) ~ agegp_f + alc_num + tob_num, data = esoph,
                         family = binomial(link = "probit"))
plot(simulateResiduals(mod_esoph_probit3, refit = TRUE, n = 1000))

crPlots(mod_esoph_probit3)

AIC(mod_esoph_probit)
## [1] 221.825
AIC(mod_esoph_probit2)
## [1] 234.3696
AIC(mod_esoph_probit3)
## [1] 218.8386
confint(mod_esoph_probit3)
## Waiting for profiling to be done...
##                    2.5 %     97.5 %
## (Intercept)  -4.66120665 -2.8095856
## agegp_f35-44 -0.03856800  1.8280202
## agegp_f45-54  0.71465265  2.4983633
## agegp_f55-64  0.94248804  2.7131850
## agegp_f65-74  1.14747222  2.9354270
## agegp_f75+    1.02421193  2.9346507
## alc_num       0.29182068  0.4810442
## tob_num       0.06227855  0.2460660

This last model seems quite good and it allows to answer our question!

More?


Dataset: carData::TitanicSurvival

Goal

Exploration of the data

data(TitanicSurvival, package = "carData")
head(TitanicSurvival)
##                                 survived    sex     age passengerClass
## Allen, Miss. Elisabeth Walton        yes female 29.0000            1st
## Allison, Master. Hudson Trevor       yes   male  0.9167            1st
## Allison, Miss. Helen Loraine          no female  2.0000            1st
## Allison, Mr. Hudson Joshua Crei       no   male 30.0000            1st
## Allison, Mrs. Hudson J C (Bessi       no female 25.0000            1st
## Anderson, Mr. Harry                  yes   male 48.0000            1st
str(TitanicSurvival)
## 'data.frame':    1309 obs. of  4 variables:
##  $ survived      : Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 2 1 2 1 ...
##  $ sex           : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age           : num  29 0.917 2 30 25 ...
##  $ passengerClass: Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
summary(TitanicSurvival)
##  survived      sex           age          passengerClass
##  no :809   female:466   Min.   : 0.1667   1st:323       
##  yes:500   male  :843   1st Qu.:21.0000   2nd:277       
##                         Median :28.0000   3rd:709       
##                         Mean   :29.8811                 
##                         3rd Qu.:39.0000                 
##                         Max.   :80.0000                 
##                         NA's   :263
with(data = TitanicSurvival,
     tapply(survived, sex, function(x) mean(x == "yes")))
##    female      male 
## 0.7274678 0.1909846
with(data = TitanicSurvival,
     tapply(survived, passengerClass, function(x) mean(x == "yes")))
##       1st       2nd       3rd 
## 0.6191950 0.4296029 0.2552891
with(data = TitanicSurvival, table(sex, passengerClass))
##         passengerClass
## sex      1st 2nd 3rd
##   female 144 106 216
##   male   179 171 493

Visualization of the data

hist(TitanicSurvival$age)

library(car)
scatterplot(survived == "yes" ~ age, data = TitanicSurvival)

The effect of age does not seem linear…

We recode some variables

TitanicSurvival$surv <-  TitanicSurvival$survived == "yes"
TitanicSurvival$age_f <- cut(TitanicSurvival$age, breaks = c(0, 10, 25, 35, 50, 81), include.lowest = TRUE)
table(TitanicSurvival$age_f)
## 
##  [0,10] (10,25] (25,35] (35,50] (50,81] 
##      86     357     281     227      95

Fitting the models

We fit a serie of models with different links and considering age as factor or continuous:

mod_logit <- glm(surv ~ age_f+passengerClass+sex, data = TitanicSurvival, family = binomial(link = "logit"))
mod_probit <- glm(surv ~ age_f+passengerClass+sex, data = TitanicSurvival, family = binomial(link = "probit"))
mod_cauchit <- glm(surv ~ age_f+passengerClass+sex, data = TitanicSurvival, family = binomial(link = "cauchit"))

mod_logit2 <- glm(surv ~ age+passengerClass+sex, data = TitanicSurvival,
                 family = binomial(link = "logit"))
mod_probit2 <- glm(surv ~ age+passengerClass+sex, data = TitanicSurvival,
                  family = binomial(link = "probit"))
mod_cauchit2 <- glm(surv ~ age+passengerClass+sex, data = TitanicSurvival,
                   family = binomial(link = "cauchit"))

rbind(AIC(mod_logit),
      AIC(mod_probit),
      AIC(mod_cauchit),
      AIC(mod_logit2),
      AIC(mod_probit2),
      AIC(mod_cauchit2))
##          [,1]
## [1,] 995.0867
## [2,] 997.4459
## [3,] 978.3824
## [4,] 992.4531
## [5,] 994.5528
## [6,] 981.3796

The cauchit model with age as a factor provides a much better fit than other models.

Checking the assumptions

library(DHARMa)
resid_Titanic <- simulateResiduals(mod_cauchit, refit = TRUE, n = 1000)
plot(resid_Titanic)

testTemporalAutocorrelation(resid_Titanic, time = mod_cauchit$fitted.values)

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 1.9828, p-value = 0.3902
## alternative hypothesis: true autocorrelation is greater than 0
testTemporalAutocorrelation(resid_Titanic, time = 1:nrow(mod_cauchit$model))

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 1.9914, p-value = 0.4446
## alternative hypothesis: true autocorrelation is greater than 0

No need to test for overdispersion in a binary model!

We look at predictions

data.for.pred <- with(data = TitanicSurvival,
                 expand.grid(age_f = levels(age_f), passengerClass = levels(passengerClass), sex = levels(sex)))
pred <- predict(mod_cauchit, newdata = data.for.pred, type = "link")
data.to.plot <- cbind(data.for.pred, pred = pred)
coplot(pred ~ unclass(age_f) | passengerClass + sex, data = data.to.plot)

library(lattice) ## same, more fancy
xyplot(pred ~ unclass(age_f) | passengerClass + sex, data = data.to.plot)

GAM?

Another way to cope with the non-linearity of the age effect would be to do a GAM model:

library(mgcv)
mod_cauchit_gam <- gam(surv ~ s(age) + passengerClass + sex, data = TitanicSurvival, family = binomial(link = "cauchit"))
AIC(mod_cauchit_gam)
## [1] 972.3109

The model is a bit better, we could use it.

GAM model assumptions:

cauchit_gam_resid <- simulateResiduals(mod_cauchit_gam)
plot(cauchit_gam_resid)

scatter.smooth(mod_cauchit_gam$model$age, cauchit_gam_resid$scaledResiduals, col = 2) ## linearity

Tests

Anova(mod_cauchit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: surv
##                LR Chisq Df Pr(>Chisq)    
## age_f            37.089  4  1.727e-07 ***
## passengerClass  124.906  2  < 2.2e-16 ***
## sex             292.157  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod_cauchit_gam)
## 
## Family: binomial 
## Link function: cauchit 
## 
## Formula:
## surv ~ s(age) + passengerClass + sex
## 
## Parametric Terms:
##                df Chi.sq  p-value
## passengerClass  2  67.04 2.77e-15
## sex             1  87.49  < 2e-16
## 
## Approximate significance of smooth terms:
##          edf Ref.df Chi.sq  p-value
## s(age) 7.488  8.327  38.67 7.33e-06

Predictions

data.for.pred2 <- with(data = TitanicSurvival,
                      expand.grid(age_f = levels(TitanicSurvival$age_f), passengerClass = levels(passengerClass), sex = levels(sex)))
data.for.pred2$age <- NA
data.for.pred2$age[data.for.pred2$age_f == "[0,10]"] <- (10 - 0)/2
data.for.pred2$age[data.for.pred2$age_f == "(10,25]"] <- (25 + 10)/2
data.for.pred2$age[data.for.pred2$age_f == "(25,35]"] <- (25 + 35)/2
data.for.pred2$age[data.for.pred2$age_f == "(35,50]"] <- (35 + 50)/2
data.for.pred2$age[data.for.pred2$age_f == "(50,81]"] <- (50 + 81)/2

pred_age_f <- predict(mod_cauchit, newdata = data.for.pred2, type = "response")
pred_age_gam <- predict(mod_cauchit_gam, newdata = data.for.pred2, type = "response")
data.to.plot2 <- cbind(data.for.pred2, pred_age_f = pred_age_f, pred_age_gam = pred_age_gam)

pred_f_1 <- subset(data.to.plot2, sex == "female" & passengerClass == "1st")
pred_f_2 <- subset(data.to.plot2, sex == "female" & passengerClass == "2nd")
pred_f_3 <- subset(data.to.plot2, sex == "female" & passengerClass == "3rd")

pred_m_1 <- subset(data.to.plot2, sex == "male" & passengerClass == "1st")
pred_m_2 <- subset(data.to.plot2, sex == "male" & passengerClass == "2nd")
pred_m_3 <- subset(data.to.plot2, sex == "male" & passengerClass == "3rd")
par(mfrow = c(1, 2))
plot(pred_f_1$pred_age_f ~ pred_f_1$age, type = "o", lwd = 2, ylim = c(0, 1.4),
    ylab = "Prob. Survival", xlab = "Age", main = "Age as factor")
points(pred_f_2$pred_age_f ~ pred_f_2$age, type = "o", lwd = 2, col = 2)
points(pred_f_3$pred_age_f ~ pred_f_3$age, type = "o", lwd = 2, col = 3)

points(pred_m_1$pred_age_f ~ pred_m_1$age, type = "o", lwd = 2, col = 1, lty = 2)
points(pred_m_2$pred_age_f ~ pred_m_2$age, type = "o", lwd = 2, col = 2, lty = 2)
points(pred_m_3$pred_age_f ~ pred_m_3$age, type = "o", lwd = 2, col = 3, lty = 2)

legend("top", lty = c(1, 1, 1, 2, 2, 2), col = c(1:3, 1:3),
       legend = c("f_1", "f_2", "f_3", "m_1", "m_2", "m_3"),
       lwd = rep(2, 6), cex = 0.6)

plot(pred_f_1$pred_age_gam ~ pred_f_2$age, type = "o", ylim = c(0, 1.4),
    ylab = "Prob. Survival", xlab = "Age", main = "Age as GAM")
points(pred_f_2$pred_age_gam ~ pred_f_2$age, type = "o", col = 2)
points(pred_f_3$pred_age_gam ~ pred_f_3$age, type = "o", col = 3)

points(pred_m_1$pred_age_gam ~ pred_m_1$age, type = "o", col = 1, lty = 2)
points(pred_m_2$pred_age_gam ~ pred_m_2$age, type = "o", col = 2, lty = 2)
points(pred_m_3$pred_age_gam ~ pred_m_3$age, type = "o", col = 3, lty = 2)

legend("top", lty = c(1, 1, 1, 2, 2, 2), col = c(1:3, 1:3),
       legend = c("f_1", "f_2", "f_3", "m_1", "m_2", "m_3"),
       lwd = rep(1, 6), cex = 0.6)

Odds-ratio

As the model is not fitted with link = logit, the odd ratio are not constant with age…

p1 <- pred_f_1[pred_f_1$age == 5, "pred_age_f"]
p2 <- pred_m_3[pred_m_3$age == 5, "pred_age_f"]
(p1/(1 - p1)) / (p2/(1 - p2))
## [1] 41.30262
p1 <- pred_f_1[pred_f_1$age == 65.5, "pred_age_f"]
p2 <- pred_m_3[pred_m_3$age == 65.5, "pred_age_f"]
(p1/(1 - p1)) / (p2/(1 - p2))
## [1] 80.77962
p1 <- pred_f_1[pred_f_1$age == 5, "pred_age_gam"]
p2 <- pred_m_3[pred_m_3$age == 5, "pred_age_gam"]
(p1/(1 - p1)) / (p2/(1 - p2))
## [1] 71.77181
p1 <- pred_f_1[pred_f_1$age == 65.5, "pred_age_gam"]
p2 <- pred_m_3[pred_m_3$age == 65.5, "pred_age_gam"]
(p1/(1 - p1)) / (p2/(1 - p2))
## [1] 47.88521

Note: the differences between the two models are at the extrems for which have few observations.

with(subset(TitanicSurvival, age_f == "[0,10]"), table(passengerClass, sex))
##               sex
## passengerClass female male
##            1st      1    3
##            2nd     11   11
##            3rd     29   31
with(subset(TitanicSurvival, age_f == "(50,81]"), table(passengerClass, sex))
##               sex
## passengerClass female male
##            1st     27   37
##            2nd      4   16
##            3rd      1   10


Dataset: LM2GLMM::Challenger

Goal

Let’s explore the data

str(Challenger)
## 'data.frame':    23 obs. of  5 variables:
##  $ oring_tot: int  6 6 6 6 6 6 6 6 6 6 ...
##  $ oring_dt : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ temp     : int  66 70 69 68 67 72 73 70 57 63 ...
##  $ psi      : int  50 50 50 50 50 50 100 100 200 200 ...
##  $ flight   : int  1 2 3 4 5 6 7 8 9 10 ...
summary(Challenger)
##    oring_tot    oring_dt           temp            psi            flight    
##  Min.   :6   Min.   :0.0000   Min.   :53.00   Min.   : 50.0   Min.   : 1.0  
##  1st Qu.:6   1st Qu.:0.0000   1st Qu.:67.00   1st Qu.: 75.0   1st Qu.: 6.5  
##  Median :6   Median :0.0000   Median :70.00   Median :200.0   Median :12.0  
##  Mean   :6   Mean   :0.3043   Mean   :69.57   Mean   :152.2   Mean   :12.0  
##  3rd Qu.:6   3rd Qu.:0.5000   3rd Qu.:75.00   3rd Qu.:200.0   3rd Qu.:17.5  
##  Max.   :6   Max.   :2.0000   Max.   :81.00   Max.   :200.0   Max.   :23.0
table(nb_orings = Challenger$oring_tot, nb_pb_orings = Challenger$oring_dt)
##          nb_pb_orings
## nb_orings  0  1  2
##         6 17  5  1

Let’s visualize the data

plot(jitter(oring_dt, factor = 0.3) ~ temp, data = Challenger)

Notice the user of jitter!

We recode the data

Challenger$oring_fine <- Challenger$oring_tot - Challenger$oring_dt

We fit the models

mod_binom_logit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger,
                       family = binomial(link = "logit"))
mod_binom_probit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger,
                        family = binomial(link = "probit"))
mod_binom_cauchit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger,
                         family = binomial(link = "cauchit"))
AIC(mod_binom_logit)
## [1] 24.8652
AIC(mod_binom_probit)
## [1] 24.76826
AIC(mod_binom_cauchit)
## [1] 26.63011

We could choose the logit or the probit model. It would be easier with the logit to interprete the estimates but we do not need this here, so let us consider the probit model since it is best.

Let’s check assumptions

r_probit <- simulateResiduals(mod_binom_probit, n = 1000)
plot(r_probit)

testUniformity(r_probit)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.26857, p-value = 0.05908
## alternative hypothesis: two-sided
testTemporalAutocorrelation(r_probit, time = 1:nrow(Challenger))

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 0.92502, p-value = 0.002246
## alternative hypothesis: true autocorrelation is greater than 0
testTemporalAutocorrelation(r_probit, time = mod_binom_probit$fitted.values)

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 2.2108, p-value = 0.6963
## alternative hypothesis: true autocorrelation is greater than 0
testTemporalAutocorrelation(r_probit, time = Challenger$temp)

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 2.2108, p-value = 0.6963
## alternative hypothesis: true autocorrelation is greater than 0
testZeroInflation(r_probit)  ## does not work (fix in progress)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with simulation under H0 = fitted model
## 
## data:  r_probit
## ratioObsExp = 0, p-value = 0.001
## alternative hypothesis: more
table(Challenger$oring_fine == 0)
## 
## FALSE 
##    23

Let us try to check the zero-inflation differently:

test <- replicate(1000, sum(simulate(mod_binom_probit, 1)[[1]][, 2] < 1))
hist(test)
abline(v = sum(Challenger$oring_dt < 1), col = "red", lwd = 2)

Tests

Anova(mod_binom_probit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: cbind(oring_fine, oring_dt)
##      LR Chisq Df Pr(>Chisq)    
## temp   11.276  1  0.0007853 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_binom_probit)
## 
## Call:
## glm(formula = cbind(oring_fine, oring_dt) ~ temp, family = binomial(link = "probit"), 
##     data = Challenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5377   0.1555   0.2984   0.5585   0.7894  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -4.21720    1.85200  -2.277  0.02278 * 
## temp         0.08873    0.02888   3.072  0.00212 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20.7057  on 22  degrees of freedom
## Residual deviance:  9.4301  on 21  degrees of freedom
## AIC: 24.768
## 
## Number of Fisher Scoring iterations: 6

Predictions

pred <- data.frame(temp = 31:81)
pred_probit <- predict(mod_binom_probit, newdata = pred, se.fit = TRUE)
pred$pred <- binomial(link = "probit")$linkinv(pred_probit$fit)
pred$lwr <- binomial(link = "probit")$linkinv(
  pred_probit$fit + qnorm(0.025) *
  sqrt(pred_probit$se.fit^2 + pred_probit$residual.scale^2))
pred$upr <- binomial(link = "probit")$linkinv(
  pred_probit$fit + qnorm(0.975) *
    sqrt(pred_probit$se.fit^2 + pred_probit$residual.scale^2))

plot(1 - pred$pred ~ pred$temp, type = "l",
     ylim = c(0, 1), ylab = "Probability of failure",
     xlab = "Temp (F)")
points(1 - pred$lwr ~ pred$temp, type = "l", lty = 2)
points(1 - pred$upr ~ pred$temp, type = "l", lty = 2)
points(oring_dt / oring_tot ~ temp, data = Challenger, col = "red")

## Predict logit
pred2 <- data.frame(temp = 31:81)
pred_logit <- predict(mod_binom_logit, newdata = pred, se.fit = TRUE)
pred2$pred <- binomial(link = "logit")$linkinv(pred_logit$fit)
pred2$lwr <- binomial(link = "logit")$linkinv(
  pred_logit$fit + qnorm(0.025) *
    sqrt(pred_logit$se.fit^2 + pred_logit$residual.scale^2))
pred2$upr <- binomial(link = "logit")$linkinv(
  pred_logit$fit + qnorm(0.975) *
    sqrt(pred_logit$se.fit^2 + pred_logit$residual.scale^2))

points(1 - pred2$pred ~ pred2$temp, type = "l", col = 3)
points(1 - pred2$lwr ~ pred2$temp, type = "l", lty = 2, col = 3)
points(1 - pred2$upr ~ pred2$temp, type = "l", lty = 2, col = 3)

More

Since a single failure could be fatal, perhaps a binary model would make more sense!

You could try to replicate the original study that excluded all the OK flights!


Dataset: LM2GLMM::UK

Goal

Smoking behaviour

Looking at the data

str(UK)
## 'data.frame':    14875 obs. of  19 variables:
##  $ sex               : Factor w/ 2 levels "Boy","Girl": 2 1 1 2 2 1 2 2 2 1 ...
##  $ age               : num  3660 3841 3675 3694 3742 ...
##  $ weight            : num  29.4 31.6 34.9 30.2 NA 24.6 40.8 29.9 25.6 29.1 ...
##  $ height            : num  134 141 143 140 NA ...
##  $ glasses           : Factor w/ 2 levels "No","Yes": 1 1 NA 1 NA 1 1 1 1 1 ...
##  $ cigarettes_mum    : num  0 20 20 0 0 15 20 0 30 0 ...
##  $ cigarettes_dad    : num  0 0 0 0 0 0 30 0 0 0 ...
##  $ cigarettes        : num  0 20 20 0 0 15 50 0 30 0 ...
##  $ cigarettes_kid    : Factor w/ 8 levels "Never","Tried once",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ cigarettes_friends: Factor w/ 3 levels "None of them",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ bronchitis        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ drink             : Factor w/ 4 levels "2 to 3 times a week",..: 4 3 4 4 NA 3 3 4 1 4 ...
##  $ milk              : num  0 1 3 0 0 3 2 3 0 1 ...
##  $ coca              : num  0 1 1 0 0 0 1 0 2 1 ...
##  $ backward          : num  20 16 4 12 NA 14 8 20 6 7 ...
##  $ mother_height     : num  157 157 155 160 NA 155 155 165 168 157 ...
##  $ father_height     : num  168 NA 175 NA NA 173 173 178 168 175 ...
##  $ mother_weight     : num  51 51 50 63 NA 50 57 90 61 53 ...
##  $ father_weight     : num  79 NA 72 NA NA 76 69 79 65 73 ...
lapply(UK, summary)
## $sex
##  Boy Girl 
## 7713 7162 
## 
## $age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    3557    3666    3685    3708    3721    4237    3253 
## 
## $weight
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   23.00   28.60   31.80   32.48   35.40   51.00    2433 
## 
## $height
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   117.0   134.2   138.4   138.6   142.9   163.2    2147 
## 
## $glasses
##    No   Yes  NA's 
## 11997  1454  1424 
## 
## $cigarettes_mum
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   5.848  10.000  80.000 
## 
## $cigarettes_dad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   7.337  15.000  99.000 
## 
## $cigarettes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    4.00   13.18   20.00  120.00 
## 
## $cigarettes_kid
##               Never          Tried once         Tried twice       Lt 1 per week    About 1 per week        2-5 per week     About 1 per day More than 1 per day                NA's 
##               10808                1286                 324                  68                  45                  37                   8                  15                2284 
## 
## $cigarettes_friends
## None of them Some of them Most of them         NA's 
##        10552         1755          261         2307 
## 
## $bronchitis
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1498  0.0000  1.0000 
## 
## $drink
## 2 to 3 times a week           Most days          Not at all Once a week or less                NA's 
##                 539                 148                6689                5354                2145 
## 
## $milk
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   1.000   1.679   2.000  26.000    2580 
## 
## $coca
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.889   1.000  51.000    2580 
## 
## $backward
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   11.00   18.00   15.44   20.00   20.00    2318 
## 
## $mother_height
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   124.0   157.0   160.0   161.3   165.0   188.0    1473 
## 
## $father_height
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   137.0   170.0   175.0   175.3   180.0   211.0    2068 
## 
## $mother_weight
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.00   54.00   60.00   61.01   66.00  164.00    1603 
## 
## $father_weight
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   41.00   69.00   75.00   75.24   82.00  167.00    2334
UK$cigarettes_kid_bin <- UK$cigarettes_kid != "Never" 
table(UK$cigarettes_kid_bin, UK$cigarettes_friends)
##        
##         None of them Some of them Most of them
##   FALSE         9484         1128          142
##   TRUE          1045          612          113
scatterplot(cigarettes_kid_bin ~ age + sex, data = UK)
## Warning in smoother(.x[subs], .y[subs], col = col[i], log.x = logged("x"), : could not fit smooth

scatterplot(cigarettes_kid_bin ~ cigarettes + sex, data = UK)
## Warning in smoother(.x[subs], .y[subs], col = col[i], log.x = logged("x"), : could not fit smooth

scatter.smooth(UK$age, UK$cigarettes, lpars = list(col = "red", lwd = 2))

Fiting the model

mod_logit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
                 family = binomial(link = "logit"), data = UK)
mod_probit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
                  family = binomial(link = "probit"), data = UK)
mod_cauchit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
                   family = binomial(link = "cauchit"), data = UK)
c(AIC(mod_logit), AIC(mod_probit), AIC(mod_cauchit))
## [1] 8512.079 8513.535 8513.982

Comparison to null model

anova(mod_logit, update(mod_logit, . ~ 1, data = mod_logit$model), test = "LR")
## Analysis of Deviance Table
## 
## Model 1: cigarettes_kid_bin ~ sex * (cigarettes + age + cigarettes_friends)
## Model 2: cigarettes_kid_bin ~ 1
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     11471     8492.1                          
## 2     11480     9362.7 -9  -870.58 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test of the model assumptions

resid_mod_logit <- simulateResiduals(mod_logit, refit = TRUE, n = 1000) ## slow
plot(resid_mod_logit)

testUniformity(resid_mod_logit)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.012736, p-value = 0.04824
## alternative hypothesis: two-sided
testTemporalAutocorrelation(resid_mod_logit, time = mod_logit$fitted.values)

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 2.0072, p-value = 0.6506
## alternative hypothesis: true autocorrelation is greater than 0
testTemporalAutocorrelation(resid_mod_logit, time = 1:nrow(mod_logit$model))

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 1.995, p-value = 0.3935
## alternative hypothesis: true autocorrelation is greater than 0

Tests

Anova(mod_logit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: cigarettes_kid_bin
##                        LR Chisq Df Pr(>Chisq)    
## sex                      106.38  1  < 2.2e-16 ***
## cigarettes                26.01  1  3.394e-07 ***
## age                       11.78  1  0.0005995 ***
## cigarettes_friends       623.94  2  < 2.2e-16 ***
## sex:cigarettes             0.09  1  0.7663928    
## sex:age                    0.12  1  0.7329848    
## sex:cigarettes_friends     0.76  2  0.6837439    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Predictions

Using the nice feature “cond” from visreg:

visreg(mod_logit, "cigarettes", by = "sex", scale = "response",
       cond = list(cigarettes_friends = "None of them", age = mean(mod_logit$model$age)))

visreg(mod_logit, "cigarettes", by = "sex", scale = "response",
       cond = list(cigarettes_friends = "Most of them", age = mean(mod_logit$model$age)))

visreg(mod_logit, "age", by = "sex", scale = "response",
       cond = list(cigarettes = 0, cigarettes_friends = "None of them"))

visreg(mod_logit, "cigarettes_friends", by = "sex", scale = "response",
       cond = list(cigarettes = 0, age = mean(mod_logit$model$age)))


Dataset: HSE98women, exercise menopause

Goal

Note: here, for the sake of simplicity we simply define menopause as the lack of periods in old ages.

Data preparation and exploration

We create the response variable:

HSE98women$menopause <-  HSE98women$period == FALSE

We fit a first model to help us easily exploring the data used for fitting and not all data (there are many missing values in the dataset).

mod_menop_logit <- glm(menopause ~ age + bmi + smoked, data = HSE98women, family = binomial())
nrow(HSE98women) ## original number of observations
## [1] 10556
nrow(mod_menop_logit$model)  ## number of observations considered in the model
## [1] 7218

We compare the variables before and after filtering to see if any obvious biais is introduced by the observations we discared:

summary(HSE98women[, c("menopause", "age", "bmi", "smoked")])
##  menopause            age             bmi          smoked       
##  Mode :logical   Min.   : 2.00   Min.   :12.39   Mode :logical  
##  FALSE:4303      1st Qu.:23.00   1st Qu.:21.11   FALSE:3959     
##  TRUE :3741      Median :39.00   Median :24.38   TRUE :4735     
##  NA's :2512      Mean   :40.52   Mean   :24.96   NA's :1862     
##                  3rd Qu.:58.00   3rd Qu.:28.30                  
##                  Max.   :97.00   Max.   :54.95                  
##                                  NA's   :1172
summary(mod_menop_logit$model)
##  menopause            age             bmi          smoked       
##  Mode :logical   Min.   :18.00   Min.   :13.80   Mode :logical  
##  FALSE:3967      1st Qu.:33.00   1st Qu.:22.79   FALSE:3214     
##  TRUE :3251      Median :46.00   Median :25.55   TRUE :4004     
##                  Mean   :47.38   Mean   :26.47                  
##                  3rd Qu.:61.00   3rd Qu.:29.30                  
##                  Max.   :94.00   Max.   :54.95

The only big discrepancy between the two datasets is that the minimal age is now 18 yrs old while the original dataset also contains children. That suggests that girls too young to have periods are not being considered, which will greatly help us to model menopause.

Let’s check the correlation between our two continuous predictor:

cor.test(HSE98women$age, HSE98women$bmi)
## 
##  Pearson's product-moment correlation
## 
## data:  HSE98women$age and HSE98women$bmi
## t = 50.874, df = 9382, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4489847 0.4807042
## sample estimates:
##       cor 
## 0.4649937
cor.test(mod_menop_logit$model$age, mod_menop_logit$model$bmi)
## 
##  Pearson's product-moment correlation
## 
## data:  mod_menop_logit$model$age and mod_menop_logit$model$bmi
## t = 13.85, df = 7216, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1383667 0.1833131
## sample estimates:
##       cor 
## 0.1609233

The correlation is not very high, especially in the dataset that is used for the fit, so multicollinearity should not be an issue.

We visualise the relationship between the predictor and the response:

library(car)
mod_menop_logit$model$smoked <- as.factor(mod_menop_logit$model$smoked) ## patch to solve bug in car 3.0
scatterplot(menopause ~ age + smoked, data = mod_menop_logit$model)

scatterplot(menopause ~ bmi + smoked, data = mod_menop_logit$model)

There don’t seem to be any obvious interaction between the age and smoked or between BMI and smoked, so we won’t consider any. We will however consider that the effect of age may depend on the BMI.

Fitting the models

mod_menop_logit   <- glm(menopause ~ age * bmi + smoked, data = HSE98women, family = binomial())
mod_menop_probit  <- glm(menopause ~ age * bmi + smoked, data = HSE98women, family = binomial(link = "probit"))
mod_menop_cauchit <- glm(menopause ~ age * bmi + smoked, data = HSE98women, family = binomial(link = "cauchit"))

Because it is very difficult to explore the shape of the relationship between the age and bmi upon the probability of being menopaused, we will also fit general additive models (GAM), which allows parameters to vary along the predictors.

library(mgcv)
mod_menop_logit_gam   <- gam(menopause ~ s(age, bmi) + smoked, data = HSE98women, family = binomial())
mod_menop_probit_gam  <- gam(menopause ~ s(age, bmi) + smoked, data = HSE98women, family = binomial(link = "probit"))
mod_menop_cauchit_gam <- gam(menopause ~ s(age, bmi) + smoked, data = HSE98women, family = binomial(link = "cauchit"))
cbind(logit = AIC(mod_menop_logit),
      probit = AIC(mod_menop_probit),
      cauchit = AIC(mod_menop_cauchit),
      logit_GAM = AIC(mod_menop_logit_gam),
      probit_GAM = AIC(mod_menop_probit_gam),
      cauchit_GAM = AIC(mod_menop_cauchit_gam)
      )
##         logit   probit  cauchit logit_GAM probit_GAM cauchit_GAM
## [1,] 3985.441 4097.282 3839.312   3720.37   3722.458     3733.11

The GAM logit model seems to be the best model, which will prevent us to express simply the effect of coefficients compared to a usual GLM. That’s too bad but as the model is much better, we will go for it.

Tests

We are going to test for the effect of the interaction by Likelihood ratio test:

mod_menop_logit_gam_no_int <- gam(menopause ~ s(age) + s(bmi) + smoked, data = HSE98women, family = binomial())
LR <- -2 * (logLik(mod_menop_logit_gam_no_int)[[1]] - logLik(mod_menop_logit_gam)[[1]])
DF <- (mod_menop_logit_gam$df.null - mod_menop_logit_gam$df.residual) - (mod_menop_logit_gam_no_int$df.null - mod_menop_logit_gam_no_int$df.residual)
(PV <- pchisq(LR, DF, lower.tail = FALSE))
## [1] 0.175853

Note: you can also use lmtest::lrtest() but it rounds the DF :-/

The interaction between age and BMI is not significant (LRT, LR = 4.5, df = 2.69, p-value = 0.18).

Let’s now test the effect of BMI by Likelihood ratio test:

mod_menop_logit_gam_no_bmi <- gam(menopause ~ s(age) + smoked, data = HSE98women, family = binomial())
LR <- -2 * (logLik(mod_menop_logit_gam_no_bmi)[[1]] - logLik(mod_menop_logit_gam)[[1]])
DF <- (mod_menop_logit_gam$df.null - mod_menop_logit_gam$df.residual) - (mod_menop_logit_gam_no_bmi$df.null - mod_menop_logit_gam_no_bmi$df.residual)
(PV <- pchisq(LR, DF, lower.tail = FALSE))
## [1] 5.514563e-157

The effect of BMI is strongly significant (LRT, LR = 732.7, df = 4.23, p-value < 0.001).

Note that models with GAM have non integer degrees of freedom. This is normal.

Let’s now test the effect of age by Likelihood ratio test:

mod_menop_logit_gam_no_age <- gam(menopause ~ s(bmi) + smoked, data = HSE98women, family = binomial())
LR <- -2 * (logLik(mod_menop_logit_gam_no_age)[[1]] - logLik(mod_menop_logit_gam)[[1]])
DF <- (mod_menop_logit_gam$df.null - mod_menop_logit_gam$df.residual) - (mod_menop_logit_gam_no_age$df.null - mod_menop_logit_gam_no_age$df.residual)
(PV <- pchisq(LR, DF, lower.tail = FALSE))
## [1] 0

The effect of age is strongly significant (LRT, LR = 5900.1, df = 5.42, p-value < 0.001).

Let’s now test the effect of smoking by Likelihood ratio test:

mod_menop_logit_gam_never_smoked <- gam(menopause ~ s(age, bmi), data = HSE98women, family = binomial())
LR <- -2 * (logLik(mod_menop_logit_gam_never_smoked)[[1]] - logLik(mod_menop_logit_gam)[[1]])
DF <- (mod_menop_logit_gam$df.null - mod_menop_logit_gam$df.residual) - (mod_menop_logit_gam_never_smoked$df.null - mod_menop_logit_gam_never_smoked$df.residual)
(PV <- pchisq(LR, DF, lower.tail = FALSE))
## [1] 0.0001646144

The effect of have smooked is also significant (LRT, LR = 14.5, df = 1.09, p-value < 0.001).

Predictions

Let’s first explore the model output with the pacakge visreg, which also works on GAM models:

library(visreg)
visreg(mod_menop_logit_gam)

These plots show the patterns identified by the smoothing parameters in the GAM: the effect of age seems to change after 40 yrs and the effect of BMI seem to be changing between 23 and 32 yrs. It is difficult to account for such threshold effects by polynoms, so we will have to stick to GAM here…

Le’t us do the same plots on the scale of the response variable:

visreg(mod_menop_logit_gam, scale = "response")

visreg2d(mod_menop_logit_gam, "age", "bmi", scale = "response")

#visreg2d(mod_menop_logit_gam, "age", "bmi", scale = "response", plot.type = "rgl")

We will now plot the predictions for the model with the logit link, the model with the cauchit link and the model with the GAM model with the logit link to explore their differences. We will predict the influence of age and consider the median BMI and smokers.

age.for.pred <- seq(min(mod_menop_cauchit_gam$model$age), max(mod_menop_cauchit_gam$model$age), length = 100)
data.for.pred <- expand.grid(
  age = age.for.pred,
  bmi = median(mod_menop_cauchit_gam$model$bmi),
  smoked = TRUE
  )

  gam.p <- predict(mod_menop_logit_gam, newdata = data.for.pred, se.fit = TRUE)
gam.upr <- binomial(link = "logit")$linkinv(gam.p$fit + qnorm(0.975) * gam.p$se.fit)
gam.lwr <- binomial(link = "logit")$linkinv(gam.p$fit + qnorm(0.025) * gam.p$se.fit)
gam.fit <- binomial(link = "logit")$linkinv(gam.p$fit)

cauchit.p <- predict(mod_menop_cauchit, newdata = data.for.pred, se.fit = TRUE)
cauchit.upr <- binomial(link = "cauchit")$linkinv(cauchit.p$fit + qnorm(0.975) * cauchit.p$se.fit)
cauchit.lwr <- binomial(link = "cauchit")$linkinv(cauchit.p$fit + qnorm(0.025) * cauchit.p$se.fit)
cauchit.fit <- binomial(link = "cauchit")$linkinv(cauchit.p$fit)

logit.p <- predict(mod_menop_logit, newdata = data.for.pred, se.fit = TRUE)
logit.upr <- binomial(link = "logit")$linkinv(logit.p$fit + qnorm(0.975) * logit.p$se.fit)
logit.lwr <- binomial(link = "logit")$linkinv(logit.p$fit + qnorm(0.025) * logit.p$se.fit)
logit.fit <- binomial(link = "logit")$linkinv(logit.p$fit)

plot(gam.fit ~ age.for.pred, type = "l", las = 1, ylab = "Probability of being menopaused", xlab = "Age (yrs)", ylim = c(0, 1))
points(gam.upr ~ age.for.pred, type = "l", lty = 2)
points(gam.lwr ~ age.for.pred, type = "l", lty = 2)

points(cauchit.fit ~ age.for.pred, type = "l", lty = 1, col = "green")
points(cauchit.upr ~ age.for.pred, type = "l", lty = 2, col = "green")
points(cauchit.lwr ~ age.for.pred, type = "l", lty = 2, col = "green")

points(logit.fit ~ age.for.pred, type = "l", lty = 1, col = "red")
points(logit.upr ~ age.for.pred, type = "l", lty = 2, col = "red")
points(logit.lwr ~ age.for.pred, type = "l", lty = 2, col = "red")

legend("topleft", fill = c("black", "green", "red"), legend = c("logit GAM", "cauchit", "logit"), bty = "n")

Selecting the best model, let’s now look at the effect of BMI, still using smokers:

data.for.pred <- expand.grid(
  age = age.for.pred,
  bmi = c(18.5, 25, 30),
  smoked = TRUE
  )

gam.skiny.p <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 18.5), se.fit = TRUE)
gam.skiny.upr <- binomial(link = "logit")$linkinv(gam.skiny.p$fit + qnorm(0.975) * gam.skiny.p$se.fit)
gam.skiny.lwr <- binomial(link = "logit")$linkinv(gam.skiny.p$fit + qnorm(0.025) * gam.skiny.p$se.fit)
gam.skiny.fit <- binomial(link = "logit")$linkinv(gam.skiny.p$fit)

gam.overweight.p <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 25), se.fit = TRUE)
gam.overweight.upr <- binomial(link = "logit")$linkinv(gam.overweight.p$fit + qnorm(0.975) * gam.overweight.p$se.fit)
gam.overweight.lwr <- binomial(link = "logit")$linkinv(gam.overweight.p$fit + qnorm(0.025) * gam.overweight.p$se.fit)
gam.overweight.fit <- binomial(link = "logit")$linkinv(gam.overweight.p$fit)

gam.obese.p <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 30), se.fit = TRUE)
gam.obese.upr <- binomial(link = "logit")$linkinv(gam.obese.p$fit + qnorm(0.975) * gam.obese.p$se.fit)
gam.obese.lwr <- binomial(link = "logit")$linkinv(gam.obese.p$fit + qnorm(0.025) * gam.obese.p$se.fit)
gam.obese.fit <- binomial(link = "logit")$linkinv(gam.obese.p$fit)

plot(gam.skiny.fit ~ age.for.pred, type = "l", las = 1, ylab = "Probability of being menopaused", xlab = "Age (yrs)", ylim = c(0, 1))
points(gam.skiny.upr ~ age.for.pred, type = "l", lty = 2)
points(gam.skiny.lwr ~ age.for.pred, type = "l", lty = 2)

points(gam.overweight.fit ~ age.for.pred, type = "l", lty = 1, col = "green")
points(gam.overweight.upr ~ age.for.pred, type = "l", lty = 2, col = "green")
points(gam.overweight.lwr ~ age.for.pred, type = "l", lty = 2, col = "green")

points(gam.obese.fit ~ age.for.pred, type = "l", lty = 1, col = "red")
points(gam.obese.upr ~ age.for.pred, type = "l", lty = 2, col = "red")
points(gam.obese.lwr ~ age.for.pred, type = "l", lty = 2, col = "red")

legend("topleft", fill = c("black", "green", "red"), legend = c("skiny", "overweight", "obese"), bty = "n")

We can see on this plot that obesity seems to accelerate the onset of menopause, but it does not seem to make any difference later on.

We will now predict the proportion of women in menopause at 40, 45, 50, 55 and 60 yrs. To illustrate the effect ofthe BMI will will perform prediction for a BMI of 21.75 (middle of normal BMI range according to WHO) and 30 (lower limit for obesity) Kg/(m^2). Again, we will perform the predictions considering smoker since we have data for this category. We will also compute the confidence intervals (prediction intervals make little sense in the context of the prediction of proportion…).

data.for.pred <- expand.grid(
  age = c(40, 45, 50, 55, 60),
  bmi = c(21.75, 30),
  smoked = TRUE
  )
pred_normal <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 21.75), se.fit = TRUE)
tab_normal <- cbind(
  predict = plogis(pred_normal$fit),
  lwr.CI = plogis(pred_normal$fit + qnorm(0.025)*pred_normal$se.fit),
  upr.CI = plogis(pred_normal$fit + qnorm(0.975)*pred_normal$se.fit)
)
rownames(tab_normal) <- c(40, 45, 50, 55, 60)
round(tab_normal, 2)
##    predict lwr.CI upr.CI
## 40    0.08   0.06   0.10
## 45    0.18   0.15   0.22
## 50    0.51   0.45   0.56
## 55    0.87   0.83   0.90
## 60    0.97   0.95   0.98

This gives the information for individuals for normal BMI. Let’s now look at the corresponding proportion for the obeses:

pred_obses <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 30), se.fit = TRUE)
tab_obses <- cbind(
  predict = plogis(pred_obses$fit),
  lwr.CI = plogis(pred_obses$fit + qnorm(0.025)*pred_obses$se.fit),
  upr.CI = plogis(pred_obses$fit + qnorm(0.975)*pred_obses$se.fit)
)
rownames(tab_obses) <- c(40, 45, 50, 55, 60)
round(tab_obses, 2)
##    predict lwr.CI upr.CI
## 40    0.14   0.11   0.17
## 45    0.26   0.22   0.31
## 50    0.57   0.51   0.62
## 55    0.87   0.84   0.90
## 60    0.97   0.95   0.98

These tables confirm our expectation concerning the influenc of BMI upon the onset of menopause.

For the sake of comparion, let’s compare the first table to what we would have predicted using the cauchit model or the logit GLM model:

data.for.pred <- expand.grid(age = c(40, 45, 50, 55, 60), bmi = 21.75, smoked = TRUE)
pred_GAM <- predict(mod_menop_logit_gam, newdata = subset(data.for.pred, bmi == 21.75), se.fit = TRUE)
pred_logit <- predict(mod_menop_logit, newdata = subset(data.for.pred, bmi == 21.75), se.fit = TRUE)
pred_cauchit <- predict(mod_menop_cauchit, newdata = subset(data.for.pred, bmi == 21.75), se.fit = TRUE)
tab <- cbind(
  logit_GAM = plogis(pred_GAM$fit),
  logit_GLM = plogis(pred_logit$fit),
  cauchit_GLM = binomial(link = "cauchit")$linkinv(pred_cauchit$fit)
)
rownames(tab) <- c(40, 45, 50, 55, 60)
round(tab, 2)
##    logit_GAM logit_GLM cauchit_GLM
## 40      0.08      0.14        0.07
## 45      0.18      0.32        0.15
## 50      0.51      0.56        0.57
## 55      0.87      0.78        0.88
## 60      0.97      0.91        0.93

Let’s look at the same proportion in the raw data. We cannot be as strict otherwise we won’t get much individuals, so let’s enlarge a bit the selection and see that we get a few hundred individuals at least:

tab <- with(HSE98women,
     rbind(sum(age >= 39 & age <= 41 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
           sum(age >= 44 & age <= 46 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
           sum(age >= 49 & age <= 51 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
           sum(age >= 54 & age <= 56 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
           sum(age >= 59 & age <= 61 & bmi > 18.5 & bmi < 25, na.rm = TRUE)
           )
)
rownames(tab) <- c(40, 45, 50, 55, 60)
tab
##    [,1]
## 40  199
## 45  160
## 50  172
## 55  132
## 60   97

That seems a good selection process (+/- 1 year and the whole WHO normal BMI category), so let’s check the proportion of menopaused individuals in each of these categories (keeping both smokers and non smokers as the effect is rather small, see later):

tab <- with(HSE98women, 
     rbind(
       mean(menopause[age >= 39 & age <= 41 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
       mean(menopause[age >= 44 & age <= 46 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
       mean(menopause[age >= 49 & age <= 51 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
       mean(menopause[age >= 54 & age <= 56 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
       mean(menopause[age >= 59 & age <= 61 & bmi > 18.5 & bmi < 25], na.rm = TRUE)
     )
)
rownames(tab) <- c(40, 45, 50, 55, 60)
round(tab, 2)
##    [,1]
## 40 0.08
## 45 0.12
## 50 0.52
## 55 0.86
## 60 0.98

These results are roughly consistent with the predictions from the GAM model. It confirms that the GLM logit and cauchit are underestimating the prevalence of menopause at old age. All models however may overestimate the prevalence of menaupose at young age. Depending on the goal of the study, other link functions, not implemented in R base, could be tried but it is quite advanced so we won’t do that.

We can now turn to the exploration of the smoking effect by comparing predictions between smokers and non smokers at median BMI:

Selecting the best model, let’s now look at the effect of BMI, still using smokers:

data.for.pred.nosmoke <- expand.grid(
  age = age.for.pred,
  bmi = median(mod_menop_cauchit_gam$model$bmi),
  smoked = FALSE
  )

gam.p.nosmoke <- predict(mod_menop_logit_gam, newdata = data.for.pred.nosmoke, se.fit = TRUE)
gam.upr.nosmoke <- binomial(link = "logit")$linkinv(gam.p.nosmoke$fit + qnorm(0.975) * gam.p.nosmoke$se.fit)
gam.lwr.nosmoke <- binomial(link = "logit")$linkinv(gam.p.nosmoke$fit + qnorm(0.025) * gam.p.nosmoke$se.fit)
gam.fit.nosmoke <- binomial(link = "logit")$linkinv(gam.p.nosmoke$fit)


plot(gam.fit ~ age.for.pred, type = "l", las = 1, ylab = "Probability of being menopaused", xlab = "Age (yrs)", ylim = c(0, 1), col = "red")
points(gam.upr ~ age.for.pred, type = "l", lty = 2, col = "red")
points(gam.lwr ~ age.for.pred, type = "l", lty = 2, col = "red")

points(gam.fit.nosmoke ~ age.for.pred, type = "l", col = "green")
points(gam.upr.nosmoke ~ age.for.pred, type = "l", lty = 2, col = "green")
points(gam.lwr.nosmoke ~ age.for.pred, type = "l", lty = 2, col = "green")

legend("topleft", fill = c("green", "red"), legend = c("non-smoker", "smoker"), bty = "n")

Although significant, the effect of smoking is rather small. Because the interaction between the smoking status and the BMI was not significant, we won’t draw predictions to explore that.


Dataset: MASS::mammals

Goal

Explore the data

data(mammals, package = "MASS")
head(mammals)
##                    body brain
## Arctic fox        3.385  44.5
## Owl monkey        0.480  15.5
## Mountain beaver   1.350   8.1
## Cow             465.000 423.0
## Grey wolf        36.330 119.5
## Goat             27.660 115.0
str(mammals)
## 'data.frame':    62 obs. of  2 variables:
##  $ body : num  3.38 0.48 1.35 465 36.33 ...
##  $ brain: num  44.5 15.5 8.1 423 119.5 ...
plot(brain ~ body, data = mammals)

plot(brain ~ body, data = mammals, log = "xy")

plot(log(brain, 10) ~ log(body, 10), data = mammals, pch = ".")
text(log(mammals$body, 10), log(mammals$brain, 10),
     labels = rownames(mammals), cex = 0.8)

Reshaping the data

mammals$log10_brain <- log(mammals$brain, base = 10)
mammals$log10_body <-  log(mammals$body, base = 10)

Fitting the model

mod <- lm(log10_brain ~ log10_body, data = mammals)
plot(mod)

influence.measures(mod)
## Influence measures of
##   lm(formula = log10_brain ~ log10_body, data = mammals) :
## 
##                             dfb.1_  dfb.l10_    dffit cov.r   cook.d    hat inf
## Arctic fox                 0.12929 -0.005286  0.13865 1.011 9.58e-03 0.0162    
## Owl monkey                 0.26067 -0.147324  0.26503 0.961 3.40e-02 0.0233    
## Mountain beaver           -0.05218  0.016630 -0.05237 1.048 1.39e-03 0.0179    
## Cow                       -0.04138 -0.211577 -0.25174 1.055 3.16e-02 0.0549    
## Grey wolf                 -0.00609 -0.007042 -0.01197 1.060 7.28e-05 0.0247    
## Goat                       0.01413  0.013610  0.02525 1.057 3.24e-04 0.0227    
## Roe deer                   0.05871  0.034610  0.08615 1.041 3.75e-03 0.0192    
## Guinea pig                -0.09246  0.035742 -0.09247 1.039 4.32e-03 0.0190    
## Verbet                     0.14358  0.004869  0.15857 0.999 1.25e-02 0.0161    
## Chinchilla                 0.08146 -0.048110  0.08326 1.050 3.51e-03 0.0242    
## Ground squirrel            0.25867 -0.219237  0.28821 1.003 4.08e-02 0.0383    
## Arctic ground squirrel    -0.06762  0.028197 -0.06763 1.047 2.32e-03 0.0195    
## African giant pouched rat -0.04995  0.019801 -0.04995 1.050 1.27e-03 0.0191    
## Lesser short-tailed shrew -0.04166  0.050503 -0.05574 1.135 1.58e-03 0.0901   *
## Star-nosed mole           -0.00550  0.005087 -0.00635 1.083 2.05e-05 0.0451    
## Nine-banded armadillo     -0.12039  0.003547 -0.12965 1.016 8.40e-03 0.0161    
## Tree hyrax                -0.02700  0.005614 -0.02756 1.050 3.86e-04 0.0168    
## N.A. opossum              -0.13177  0.033606 -0.13332 1.017 8.89e-03 0.0172    
## Asian elephant             0.00691  0.169305  0.18752 1.119 1.78e-02 0.0873   *
## Big brown bat             -0.15335  0.160920 -0.18818 1.080 1.78e-02 0.0600    
## Donkey                    -0.00234 -0.007012 -0.00896 1.079 4.08e-05 0.0416    
## Horse                     -0.01962 -0.107873 -0.12749 1.086 8.23e-03 0.0568    
## European hedgehog         -0.14647  0.066672 -0.14677 1.019 1.08e-02 0.0203    
## Patas monkey               0.13081  0.051285  0.17243 0.996 1.47e-02 0.0177    
## Cat                        0.03635 -0.001799  0.03885 1.048 7.66e-04 0.0162    
## Galago                     0.16730 -0.122879  0.17828 1.032 1.59e-02 0.0307    
## Genet                      0.09091 -0.027906  0.09134 1.037 4.21e-03 0.0178    
## Giraffe                   -0.01805 -0.100220 -0.11834 1.088 7.09e-03 0.0570    
## Gorilla                   -0.01057 -0.033490 -0.04238 1.079 9.13e-04 0.0430    
## Grey seal                  0.03027  0.058263  0.08230 1.061 3.43e-03 0.0323    
## Rock hyrax-a               0.12403 -0.057785  0.12438 1.030 7.77e-03 0.0206    
## Human                      0.21965  0.352501  0.52677 0.797 1.22e-01 0.0292   *
## African elephant           0.00077 -0.049868 -0.05399 1.161 1.48e-03 0.1098   *
## Water opossum             -0.31028  0.009141 -0.33415 0.845 5.09e-02 0.0161   *
## Rhesus monkey              0.26321  0.058332  0.31729 0.868 4.65e-02 0.0167   *
## Kangaroo                  -0.09323 -0.105229 -0.18077 1.015 1.63e-02 0.0244    
## Yellow-bellied marmot     -0.05936 -0.001288 -0.06523 1.042 2.15e-03 0.0161    
## Golden hamster            -0.13927  0.114256 -0.15340 1.051 1.18e-02 0.0362    
## Mouse                     -0.06543  0.068659 -0.08029 1.096 3.27e-03 0.0600    
## Little brown bat          -0.01958  0.022374 -0.02523 1.118 3.24e-04 0.0755   *
## Slow loris                 0.02667 -0.008240  0.02680 1.052 3.65e-04 0.0178    
## Okapi                     -0.00660 -0.023306 -0.02900 1.083 4.27e-04 0.0456    
## Rabbit                    -0.05933  0.008300 -0.06160 1.043 1.92e-03 0.0164    
## Sheep                      0.00117  0.001766  0.00270 1.064 3.70e-06 0.0282    
## Jaguar                    -0.05098 -0.107563 -0.14822 1.048 1.11e-02 0.0341    
## Chimpanzee                 0.10845  0.157062  0.24339 0.992 2.91e-02 0.0276    
## Baboon                     0.19253  0.080361  0.25728 0.934 3.17e-02 0.0179    
## Desert hedgehog           -0.17726  0.094998 -0.17929 1.009 1.60e-02 0.0224    
## Giant armadillo           -0.08742 -0.137621 -0.20701 1.015 2.13e-02 0.0289    
## Rock hyrax-b              -0.00906  0.000179 -0.00980 1.051 4.88e-05 0.0161    
## Raccoon                    0.07343  0.003105  0.08138 1.037 3.34e-03 0.0162    
## Rat                       -0.12585  0.084698 -0.13144 1.042 8.69e-03 0.0276    
## E. American mole          -0.00145  0.001296 -0.00165 1.080 1.39e-06 0.0421    
## Mole rat                   0.14010 -0.114569  0.15415 1.050 1.20e-02 0.0360    
## Musk shrew                -0.27530  0.263044 -0.32233 1.015 5.11e-02 0.0483    
## Pig                       -0.07152 -0.217294 -0.27696 1.018 3.79e-02 0.0420    
## Echidna                    0.04523 -0.003677  0.04782 1.046 1.16e-03 0.0162    
## Brazilian tapir           -0.06894 -0.189122 -0.24564 1.025 2.99e-02 0.0396    
## Tenrec                    -0.22939  0.096893 -0.22948 0.967 2.56e-02 0.0196    
## Phalanger                 -0.01210  0.003252 -0.01222 1.052 7.59e-05 0.0174    
## Tree shrew                 0.12605 -0.106264  0.14018 1.057 9.91e-03 0.0379    
## Red fox                    0.11779  0.004449  0.13030 1.015 8.48e-03 0.0161
plot(mod, which = 4)

summary(mod)
## 
## Call:
## lm(formula = log10_brain ~ log10_body, data = mammals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74503 -0.21380 -0.02676  0.18934  0.84613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.92713    0.04171   22.23   <2e-16 ***
## log10_body   0.75169    0.02846   26.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3015 on 60 degrees of freedom
## Multiple R-squared:  0.9208, Adjusted R-squared:  0.9195 
## F-statistic: 697.4 on 1 and 60 DF,  p-value: < 2.2e-16
mod_nohuman <- lm(log10_brain ~ log10_body, data = mammals[rownames(mammals)!="Human", ])
plot(mod_nohuman, which = 4)

Comparing to 2/3

t <- (mod$coefficients[2] - 2/3) / sqrt(diag(vcov(mod))[2])
2*pt(abs(t), df = mod$df.residual, lower.tail = FALSE)
## log10_body 
## 0.00407627
confint(mod)
##                 2.5 %    97.5 %
## (Intercept) 0.8436923 1.0105616
## log10_body  0.6947503 0.8086215
mod2 <- lm(log10_brain ~ 1 + offset(2/3*log10_body), data = mammals)
anova(mod, mod2)
## Analysis of Variance Table
## 
## Model 1: log10_brain ~ log10_body
## Model 2: log10_brain ~ 1 + offset(2/3 * log10_body)
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     60 5.4552                                
## 2     61 6.2663 -1  -0.81117 8.9219 0.004076 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t <- (mod_nohuman$coefficients[2] - 2/3) / sqrt(diag(vcov(mod_nohuman))[2])
2*pt(abs(t), df = mod_nohuman$df.residual, lower.tail = FALSE)
##  log10_body 
## 0.006644006
confint(mod_nohuman)
##                 2.5 %    97.5 %
## (Intercept) 0.8400581 0.9970116
## log10_body  0.6885052 0.7960480
mod2_nohuman <- lm(log10_brain ~ 1 + offset(2/3*log10_body), data = mammals[rownames(mammals)!="Human", ])
anova(mod_nohuman, mod2_nohuman)
## Analysis of Variance Table
## 
## Model 1: log10_brain ~ log10_body
## Model 2: log10_brain ~ 1 + offset(2/3 * log10_body)
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     59 4.7177                                
## 2     60 5.3507 -1  -0.63303 7.9167 0.006644 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can plot the predictions from the models

plot(log(brain, 10) ~ log(body, 10), data = mammals, pch = ".")
text(log(mammals$body, 10), log(mammals$brain, 10),
     labels = rownames(mammals), cex = 0.8)
abline(mod, col = "blue", lwd = 2)
abline(mod_nohuman, col = "blue", lwd = 2, lty = 2)
abline(coef(mod2), 2/3, col = "red", lwd = 2)
abline(coef(mod2_nohuman), 2/3, col = "red", lwd = 2, lty = 2)
legend("topleft", lty = c(1, 1, 2, 2), lwd = c(2, 2, 2, 2), col = c("blue", "red", "blue", "red"),
       legend = c("slope estimated", "slope = 2/3", "slope estimated (no human)", "slope = 2/3 (no human)"))

Prediction of the brain size for an animal of 1Kg

p <- predict(mod_nohuman, newdata = data.frame(log10_body = log(1, base = 10)))
10^p
##        1 
## 8.289624

An animal of 1 Kg is predicted to have a brain size of 8.29 grams.

Let us rank the organism by brain size controling for body size

matrix(names(sort(resid(mod), decreasing = TRUE)), ncol = 1)
##       [,1]                       
##  [1,] "Human"                    
##  [2,] "Rhesus monkey"            
##  [3,] "Baboon"                   
##  [4,] "Owl monkey"               
##  [5,] "Chimpanzee"               
##  [6,] "Ground squirrel"          
##  [7,] "Patas monkey"             
##  [8,] "Verbet"                   
##  [9,] "Arctic fox"               
## [10,] "Red fox"                  
## [11,] "Galago"                   
## [12,] "Rock hyrax-a"             
## [13,] "Mole rat"                 
## [14,] "Tree shrew"               
## [15,] "Genet"                    
## [16,] "Raccoon"                  
## [17,] "Roe deer"                 
## [18,] "Asian elephant"           
## [19,] "Chinchilla"               
## [20,] "Grey seal"                
## [21,] "Echidna"                  
## [22,] "Cat"                      
## [23,] "Slow loris"               
## [24,] "Goat"                     
## [25,] "Sheep"                    
## [26,] "E. American mole"         
## [27,] "Star-nosed mole"          
## [28,] "Donkey"                   
## [29,] "Grey wolf"                
## [30,] "Rock hyrax-b"             
## [31,] "Little brown bat"         
## [32,] "Phalanger"                
## [33,] "Okapi"                    
## [34,] "African elephant"         
## [35,] "Lesser short-tailed shrew"
## [36,] "Gorilla"                  
## [37,] "Tree hyrax"               
## [38,] "Mouse"                    
## [39,] "African giant pouched rat"
## [40,] "Mountain beaver"          
## [41,] "Giraffe"                  
## [42,] "Rabbit"                   
## [43,] "Arctic ground squirrel"   
## [44,] "Horse"                    
## [45,] "Yellow-bellied marmot"    
## [46,] "Guinea pig"               
## [47,] "Big brown bat"            
## [48,] "Rat"                      
## [49,] "Jaguar"                   
## [50,] "Golden hamster"           
## [51,] "N.A. opossum"             
## [52,] "Nine-banded armadillo"    
## [53,] "European hedgehog"        
## [54,] "Cow"                      
## [55,] "Kangaroo"                 
## [56,] "Desert hedgehog"          
## [57,] "Giant armadillo"          
## [58,] "Brazilian tapir"          
## [59,] "Pig"                      
## [60,] "Musk shrew"               
## [61,] "Tenrec"                   
## [62,] "Water opossum"

Dataset: MASS::Animals

Goal

Explore the data

data(Animals, package = "MASS")
head(Animals)
##                     body brain
## Mountain beaver     1.35   8.1
## Cow               465.00 423.0
## Grey wolf          36.33 119.5
## Goat               27.66 115.0
## Guinea pig          1.04   5.5
## Dipliodocus     11700.00  50.0
str(Animals)
## 'data.frame':    28 obs. of  2 variables:
##  $ body : num  1.35 465 36.33 27.66 1.04 ...
##  $ brain: num  8.1 423 119.5 115 5.5 ...
plot(brain ~ body, data = Animals)

plot(log(brain, 10) ~ log(body, 10), data = Animals, pch = ".")
text(log(Animals$body, 10), log(Animals$brain, 10),
     labels = rownames(Animals), cex = 0.8)

Reshaping the data

Animals$log10_brain <- log(Animals$brain, base = 10)
Animals$log10_body <-  log(Animals$body, base = 10)

Fitting the model

mod <- lm(log10_brain ~ log10_body, data = Animals)
plot(mod)

influence.measures(mod)
## Influence measures of
##   lm(formula = log10_brain ~ log10_body, data = Animals) :
## 
##                    dfb.1_ dfb.l10_    dffit cov.r   cook.d    hat inf
## Mountain beaver  -0.10905  0.07464 -0.10914 1.144 6.15e-03 0.0671    
## Cow               0.01387  0.03645  0.06760 1.131 2.37e-03 0.0504    
## Grey wolf         0.04125 -0.00271  0.05619 1.114 1.64e-03 0.0358    
## Goat              0.05386 -0.00834  0.06889 1.111 2.46e-03 0.0362    
## Guinea pig       -0.16207  0.11505 -0.16207 1.135 1.35e-02 0.0720    
## Dipliodocus       0.19057 -0.76224 -0.91399 0.782 3.47e-01 0.1173   *
## Asian elephant   -0.02278  0.29807  0.40291 1.015 7.85e-02 0.0789    
## Donkey            0.04730  0.04451  0.12131 1.098 7.55e-03 0.0413    
## Horse             0.02354  0.07122  0.12783 1.114 8.40e-03 0.0518    
## Potar monkey      0.13134 -0.05296  0.14365 1.086 1.05e-02 0.0413    
## Cat               0.01461 -0.00850  0.01487 1.142 1.15e-04 0.0530    
## Giraffe           0.02409  0.07430  0.13279 1.112 9.05e-03 0.0520    
## Gorilla           0.04094  0.04320  0.11117 1.104 6.35e-03 0.0421    
## Human             0.21767  0.03310  0.34600 0.882 5.52e-02 0.0360    
## African elephant -0.06446  0.32550  0.40414 1.076 8.03e-02 0.1017    
## Triceratops       0.14087 -0.60944 -0.73995 0.881 2.42e-01 0.1110    
## Rhesus monkey     0.23101 -0.10936  0.24421 1.025 2.95e-02 0.0447    
## Kangaroo         -0.02731  0.00215 -0.03686 1.119 7.05e-04 0.0358    
## Golden hamster   -0.39051  0.33844 -0.39973 1.135 7.96e-02 0.1261    
## Mouse            -0.52481  0.49630 -0.55286 1.193 1.51e-01 0.1840    
## Rabbit           -0.08303  0.05118 -0.08382 1.136 3.64e-03 0.0569    
## Sheep             0.05082  0.00515  0.07795 1.107 3.14e-03 0.0359    
## Jaguar            0.01475  0.00615  0.02800 1.122 4.07e-04 0.0375    
## Chimpanzee        0.13396  0.00995  0.20160 1.030 2.02e-02 0.0358    
## Rat              -0.29329  0.23887 -0.29631 1.133 4.43e-02 0.1020    
## Brachiosaurus     0.40128 -1.07770 -1.19872 0.849 5.97e-01 0.1863   *
## Mole             -0.10487  0.09079 -0.10732 1.229 5.97e-03 0.1256    
## Pig               0.00159  0.00154  0.00413 1.128 8.87e-06 0.0415
plot(mod, which = 4)

summary(mod)
## 
## Call:
## lm(formula = log10_brain ~ log10_body, data = Animals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4284 -0.2937  0.1440  0.3755  1.1220 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.10958    0.17942   6.184 1.53e-06 ***
## log10_body   0.49599    0.07817   6.345 1.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6652 on 26 degrees of freedom
## Multiple R-squared:  0.6076, Adjusted R-squared:  0.5925 
## F-statistic: 40.26 on 1 and 26 DF,  p-value: 1.017e-06
Animals_nodino <- Animals[!rownames(Animals) %in%
                          c("Triceratops", "Brachiosaurus", "Dipliodocus"), ]
mod_nodino <- lm(log10_brain ~ log10_body, data = Animals_nodino)
plot(mod_nodino)

influence.measures(mod_nodino)
## Influence measures of
##   lm(formula = log10_brain ~ log10_body, data = Animals_nodino) :
## 
##                    dfb.1_  dfb.l10_    dffit cov.r   cook.d    hat inf
## Mountain beaver  -0.10881  0.071003 -0.10895 1.158 6.16e-03 0.0695    
## Cow              -0.01015 -0.212504 -0.30351 1.077 4.59e-02 0.0785    
## Grey wolf        -0.01178 -0.003483 -0.01990 1.139 2.07e-04 0.0413    
## Goat              0.01800  0.002504  0.02737 1.137 3.91e-04 0.0403    
## Guinea pig       -0.19216  0.131714 -0.19216 1.135 1.89e-02 0.0754    
## Asian elephant   -0.03879  0.182189  0.21825 1.224 2.45e-02 0.1320    
## Donkey           -0.00337 -0.009563 -0.01677 1.161 1.47e-04 0.0593    
## Horse            -0.00243 -0.111533 -0.15646 1.160 1.26e-02 0.0813    
## Potar monkey      0.22103 -0.057315  0.25724 0.999 3.24e-02 0.0421    
## Cat               0.06210 -0.031964  0.06390 1.147 2.13e-03 0.0533    
## Giraffe          -0.00192 -0.104180 -0.14581 1.165 1.10e-02 0.0817    
## Gorilla          -0.00996 -0.032457 -0.05526 1.159 1.59e-03 0.0611    
## Human             0.32062  0.228866  0.69981 0.526 1.74e-01 0.0448   *
## African elephant  0.02117 -0.072423 -0.08265 1.317 3.57e-03 0.1723   *
## Rhesus monkey     0.49177 -0.178325  0.53979 0.697 1.19e-01 0.0449   *
## Kangaroo         -0.14080 -0.038372 -0.23433 1.018 2.71e-02 0.0411    
## Golden hamster   -0.32757  0.288238 -0.33865 1.204 5.82e-02 0.1452    
## Mouse            -0.17546  0.170672 -0.18836 1.392 1.84e-02 0.2235   *
## Rabbit           -0.11782  0.066261 -0.11964 1.136 7.40e-03 0.0577    
## Sheep            -0.00100 -0.000613 -0.00206 1.143 2.23e-06 0.0439    
## Jaguar           -0.06117 -0.079749 -0.17930 1.089 1.63e-02 0.0499    
## Chimpanzee        0.14753  0.082366  0.29419 0.968 4.16e-02 0.0434    
## Rat              -0.28176  0.230127 -0.28604 1.164 4.16e-02 0.1134    
## Mole              0.31175 -0.273969  0.32217 1.209 5.28e-02 0.1445    
## Pig              -0.06495 -0.190697 -0.33188 0.999 5.34e-02 0.0597
plot(mod_nodino, which = 4)

summary(mod_nodino)
## 
## Call:
## lm(formula = log10_brain ~ log10_body, data = Animals_nodino)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39628 -0.20636 -0.06760  0.08427  0.83832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.93391    0.08712   10.72 2.03e-10 ***
## log10_body   0.75226    0.04572   16.45 3.24e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3152 on 23 degrees of freedom
## Multiple R-squared:  0.9217, Adjusted R-squared:  0.9183 
## F-statistic: 270.7 on 1 and 23 DF,  p-value: 3.243e-14

Plot

plot(log10_brain ~ log10_body, data = Animals, col = NULL)
text(x = Animals$log10_body, y = Animals$log10_brain,
     labels = rownames(Animals), cex = 0.8,
     col = ifelse( rownames(Animals) %in% c("Triceratops", "Brachiosaurus", "Dipliodocus"), "red", "black"))
p <- predict(mod, newdata = data.frame("log10_body" = range(log(Animals$body, 10))))
p_nodino <- predict(mod_nodino, newdata = data.frame("log10_body" = range(log(Animals$body, 10))))
points(x = range(Animals$log10_body), y = p, col = "red", type = "l", lty = 2)
points(x = range(Animals$log10_body), y = p_nodino, col = "blue", type = "l")